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Abstract. The advent of "next-generation" DNA sequencing (NGS) 
technologies has meant that collections of hundreds of millions of DNA 
sequences are now commonplace in bioinformatics. Knowing the longest 
common prefix array (LCP) of such a collection would facilitate the rapid 
computation of maximal exact matches, shortest unique substrings and 
shortest absent words. CPU-efficient algorithms for computing the LCP 
of a string have been described in the literature, but require the presence 
in RAM of large data structures. This prevents such methods from being 
feasible for NGS datasets. 

In this paper we propose the first lightweight method that simultane- 
ously computes, via sequential scans, the LCP and BWT of very large 
collections of sequences. Computational results on collections as large as 
800 million 100-mers demonstrate that our algorithm scales to the vast 
sequence collections encountered in human whole genome sequencing ex- 
periments. 



1 Introduction 

The longest common prefix array (LCP) of a string contains the lengths of the 
longest common prefixes of the suffixes pointed to by adjacent elements of the 
suffix array (SA) of the string [TJ] . The most immediate utility of the LCP is to 
speed up suffix array algorithms and to simulate the more powerful, but more 
resource consuming, suffix tree. When combined with the suffix array or the 
Burrows- Wheeler transform (BWT) of a string the LCP facilitates, among other 
things, the rapid search for maximal exact matches, shortest unique substrings 
and shortest absent words |14l5ll0irj . Existing algorithms for computing the 
LCP require data structures of size proportional to the input data to be held in 
RAM, which has made it impractical to compute the LCP of massive datasets 
such as the collections of hundreds of millions of reads produced by so-called 
Next-Generation Sequencing (NGS) technologies. 

In this context, the aim of our paper is designing an algorithm for the com- 
putation of the LCP of large collections of strings which works on an external 



memory system, by performing disk data accesses only via sequential scans, and 
is lightweight in the sense that its working space requirement is very low. 

Computing the LCP of a collection of strings has been considered in the 
literature [T7]. Defining A'' and K as respectively the sum of the lengths of 
all strings and the length of the longest string in the collection, the described 
approach requires 0{N\ogK) time, but the 0(A^log A^) bits of internal memory 
needed to store the collection and its SA in internal memory prevents the method 
from scaling to massive data. 

One can note that several algorithms to compute the LCP of a single string 
in semi-external memory (see for instance [TT]) or directly via BWT (see [5]) 
could be adapted to solve the problem of computing the LCP of a collection of 
strings. It could be sufficient to concatenate all the members of the collection 
into a single string and use distinct end marker symbols as separators. However, 
assigning a different end marker to each string is not feasible when the collection 
is very large, but the alternative of terminating each member with the same 
symbol could lead to LCP values that exceed the lengths of the strings and that 
depend on the order in which the strings are concatenated. In our approach, we 
compute the LCP of the collection directly from the strings, without needing 
to concatenate them and without requiring precomputed auxiliary information 
such as the SA or BWT of the collection. 

In fact, building upon the method of BWT computation introduced in |2,, our 
algorithm adds some lightweight data structures and allows the LCP and BWT 
of a collection of m strings to be computed simultaneously in 0{{'m-\-a'^)\og{N)) 
bits of memory, with a worst-case time complexity of 0{K{m + sovi(rn))), where 
sort(m) is the time taken to sort m integers, a is the size of the alphabet, A^ is 
the sum of the lengths of all strings and K is the length of the longest string. 

The low memory requirement enables our algorithm to scale to the size of 
dataset encountered in human whole genome sequencing datasets: in our exper- 
iments, we compute the BWT and LCP of collections as large as 800 millions 
100-mers. 

Section [5] gives preliminaries that we will use throughout the paper, whereas 
Section [3] describes the sequential computation of the LCP. We present details 
on the efficient implementation of the algorithm and computational results on 
real data in Sections |4] and El respectively. 

2 Preliminaries 

Let S = {ci, C2, . . . , Cct} be a finite ordered alphabet with ci < C2 < ■ ■ ■ < c^, 
where < denotes the standard lexicographic order. We append to a finite string 
w & S* an end marker symbol $ that satisfies S < ci. We denote its characters 
by w[0], it;[l], . . . , w[fc], where fc -|- 1 is the length of w, denoted by Note 
that, for i < k, w[i] G S and w[k] = $. A substring of a string w is written 
as j] = wli] ■ ■ -wlj], with a substring w[0,j] being called a prefix, while a 
substring w[i,k] is referred to as a suffix. 



The suffix array of a string w is an array SA containing the permutation of 
the integers 0. . . — 1 that arranges the starting positions of the suffixes of 
w into lexicographical order. There exist some natural extensions of the suffix 
array to a collection of sequences (see [12 )• We denote by S the collection of 
m strings {wq, wi, . . . , w„i-i}- We append to each sequence Wi an end marker 
symbol $i smaller than ci, and $j < $j if i < j. Let us denote by N the sum of 
the lengths of all strings in S. 

Let us denote by 5(pos,Soq) the suffix starting at the position Pos of the string 
wscq- We define the generalized suffix array GSA of the collection S as the array 
of N pairs (Pos, Seq), sorted by the lexicographic order of their corresponding 
suffixes 5(pos,Soq)- In particular, GS'A[i] = it,j) is the pair corresponding to the 
i-th smallest suffix of the strings in S. 

The longest common prefix array (denoted by LCP) of a collection iS of 
strings is an array storing the length of the longest common prefixes between two 
consecutive suffixes of S in the lexicographic order. For every j = 1, . . . , iV — 1, if 
GSA[j - 1] = (pi,P2) and GSA[j] = (91,92), LCP[j] is the length of the longest 
common prefix of suffixes starting at positions pi and qi of the words Wp^ and 
Wq2, respectively. We set LCP[0] = 0. 

Note that the generalization of the suffix array to a collection S of strings 
is related to an extension of the notion of the Burrows- Wheeler transform to a 
collection of strings that is a reversible transformation introduced in [12] (see 
also [13]). Actually, in its original definition, such a transformation produces a 
string that is a permutation of the characters of all strings in S but it does not 
make use of any end marker. 

In this paper we suppose that a different end marker is appended to each 
string of S. Let us denote by BWT{S) the Burrows- Wheeler transform of the 
collection S and its output is produced according to the generalized suffix array 
of S. In particular, if GSA\i] = {t,j) then BWT\i] = Wj[(t - l)mod|u;j|]. 

Note that the output of BWT{S) differs, for at least m symbols, from BWT 
applied to the string obtained by concatenating all strings in S. External memory 
methods for computing BWTiS) are given in [5]. 

3 LCP computation of a collection of strings via BWT 

The main goal of this section is to describe the strategy to compute the LCP of 
a massive collection of strings via sequential scans of the disk data. In particular, 
the main theorem of the section enables the simultaneous computation of both 
LCP and BWT of a collection S oi m strings {wq, wi, . . . , Wm-i}. We suppose 
that the last symbol of each sequence Wi is the end marker S^. Our method scans 
all the strings from right to left and both LCP and BWT are incrementally built 
by simulating, step by step, the insertion of all suffixes having the same length 
in the generalized suffix array. 

We refer to the suffix starting at the position \wi\ — j — 1 oi & string Wi as 
its j-suffix. With the end marker %i included, the j-suffix is of length j + 1; the 



O-sufftx contains $i alone. Let us denote by Sj the collection of the j-suffixes of 
all the strings of S. 

Let us denote by K the maximal length of the strings in S and by \cpj{S) 
the longest common prefix array of the collection Sj. It is easy to see that 
when j — K, lcp^ (5) coincides with the LCP of S. Since all m end-markers are 
distinct, the longest common prefix of any pair of the O-suffixes is 0, so the first 
m positions into \cpj{S) are for any j > 0. 

Note that \cpj{S) can be considered to be the concatenation of cr + 1 arrays 
Lj{Q), Lj{l), . . . ,Lj{(T) where, for h = 1, . . . ,(T, the array Lj{h) is the LCP of 
the suffixes of Sj that start with Ch G S, while Lj{Q) (corresponding to the 
O-suffixes) is an array of m zeroes. It is easy to see that lcpg(iS) = Lo{Q) and 
that Lolh) is empty for h > 0. For sake of simplicity, each segment is indexed 
starting from 1. We note that, for each < h < a, Lj{h)[l] = and Lj{h)[i] > 1 
for i > 1. 

Similarly, we define the string bwtj(5) as the Burrows- Wheeler transform 
of the collection of the j-suffixes of S. This can be partitioned in an analogous 
way into segments Bj{0), Bj{l), . . . , Bj{a), where the symbols in Bj{0) are the 
characters preceding the lexicographically sorted O-suffixes of Sj and the symbols 
in Bj{h), with h > 1, are the characters preceding the lexicographically sorted 
suffixes of 5j starting with Ch G Moreover, bwto(5) — Bo{0) and the segments 
Bij{h) are empty for h> Q. 

In this section we show that, for each j > 0, \cp j{S) can be sequentially 
constructed by using bwtj_i(iS) and \cpj_i{S) (in previous work [2], three of the 
present authors showed how bwtj(iS) may be computed from bwtj_i(5)). Note 
that bwto(iS) and lcpo(5) are defined above. 

Given the segments Bj (h) and Lj (h), ft, = 0, . . . , cr, for the symbol x occurring 
at position r of Bj{h) we define the (j, h)-LCP Current Interval of x in r (denoted 
by LCIj'{x,r)) as the range {di,r] in Lj{h), where di is the greatest position 
smaller than r of the symbol x in Bj (/i) , if such a position exists. If such a position 
does not exist, we define LCIj'{x,r) — Lj{h)[r]. Analogously, we define for the 
symbol x the {j, h)-LCP Successive Interval of a; in r (denoted by LSI^{x,r)) 
as the range (r, ^2] in Lj{h), where d2 is the smallest position greater than r of 
the symbol x in Bj{h), if it exists. If such a position does not exist we define 
LSIj'{x, r) = Lj{h)[r]. In our notation, a range is delimited by a square bracket 
if the correspondent endpoint is included, whereas the parenthesis means that 
the endpoint of the range is excluded. 

Actually, it is easy to verify that di — select(rank(a;, r) — l,x) and d2 = 
select(rank(a;, r) + 1, x), where rank(x, r) counts the number of x's until position 
r and select(p, x) finds the position of the p-th occurrence of a; in a segment Bj. 

The following theorem shows how to compute the segments Lj{h), with j > 
0, by using Lj^i{h) and Bj^i{h) for any h > 0. We denote by Sufj(O) the 
lexicographically sorted O-suffixes and by Suij{h), for h > 0, the lexicographically 
sorted t-suffixes of Sj, with t < j starting with c/j G S. 

Theorem 1. Let I — {rp < < ... < rq^i} be the set of the positions in 
Sufj{z) of the j-suffixes starting with the letter Cz- For each position rp ^ I 



(0<p< q), 

{Q j^j: ^ — I 

1 ifrl>l and iC/J_i(c,, t) = L,_i{v)[t] 

mmLCIj_i{c2,t) + 1 otherwise 

where c„ is the first character of the {j — \)-suffix of Wi, and t is the position in 
Bj-i(v) of symbol Cz preceding the [j — l)-suffix of Wi. 

For each position [rp + 1) ^ I (where rp and < p < q), then 

r + 11 - / ^ if LSIJ_,{cz,t) = L,^i{v)[t] 

i.j[Z)[rp + ij <^ min LSIJ^i{cz,t) + 1 otherwise 

For each position s, where 1 < s < (for p = 0), rp_i < s < rp (for 
< p < q — I), s > rp (for p = q — 1) then 

For lack of space, the proof of the theorem is omitted and we defer it in the 
full paper. 

A consequence of the theorem is that the segments Bj and Lj can be con- 
structed sequentially and stored in external files. This fact will be used in the 
next section. 



4 Lightweight implementation via sequential scans 

Based on the strategy described in the previous section, here we propose an 
algorithm (named extLCP) that simultaneously computes the BWT and LCP 
of a set of strings S. Memory use is minimized by reading data sequentially 
from files held in external memory: only a small proportion of the symbols of 
<S need to be held in RAM and we do not need to keep the generalized suffix 
array of S. Obviously, the generalized suffix array can be a side output of our 
implementation. 

Our method extends previous work [213] on computing the BWT of a collec- 
tion of strings and we follow the notation therein. 

Although our algorithm is not restricted to collections of strings of uniform 
length, for sake of simplicity our description supposes that iS comprises m strings 
of length k and we assume that j = 0,1, ... ,k and i = 0, 1, . . . , m — 1. We 
simulate m distinct end-markers by using a single end-marker $ = cq and setting 
Ws[k] < wt[k] if and only if s < t, so that if two strings Ws and wt share the 
j-suffix, then Ws[fc — j, k] < Wt[k — j, k] if and only if s < t. Moreover, we assume 
that the values of \cp j{S) do not exceed j and the first m positions into \cpj{S) 
are for any j > 0. 

The main part of the algorithm consists of k consecutive iterations. At itera- 
tion j, we consider all the j-su3ixes of iS and simulate their insertion in the GSA. 



In other words, for each i, we have to find the position of the suffix Wi[k — j, k] 
according to the lexicographic order of all the suffixes of S of length at most 
j, then insert the new symbol circularly preceding the j-suffix of Wi into Bj(z), 
where Cz = Wi[k — j], for some z — 1, . . . ,ct, and update the values in Lj{z). 
Consequently, both bwtj(5) and \cpj{S) are updated accordingly. Note that, at 
each iteration j, both the segments Bj and Lj, initially empty, are stored in 
different external files that replace the files used in the previous iteration. 

In order to compute hwtj{S) and \cpj{S), the algorithm needs to hold six 
arrays of m integers in internal memory. Four of these (Pj, Qj, Nj and Uj) are 
useful to compute the BWT (see a further two {Cj and Sj) are needed to 
compute and update the values of the longest common prefixes, we will give a 
description of all these arrays but, for brevity, we will focus on the computation 
of Cj and Sj . 

Each of the arrays Pj, Qj, Nj and Uj contains m elements, as detailed in the 
following. At the end of iteration j, if Wi[k — j, k] is the q-th j-suffix then: 

— Nj[q] contains the index i. It uses 0(m log m) bits of workspace. 

— Qj[q] stores the index z where = Wi[k — j], i.e. the first symbol of the 
j-suffix. It uses 0(m log a) bits of workspace. 

— Pj[q] contains the position in Bj{z) of the symbol circularly preceding the 
j-suffix Wi [k — j, k\ , such a symbol is Wi [k — j — 1] and it is stored at the 
position Nj[q\ of Uj. So, it needs 0(mlog(mfc)) bits of workspace. 

— Uj stores the new characters to be inserted, one for each sequence in 5, so 
it uses 0(rn log ct) bits of workspace. 

The arrays Cj and Sj each contain m integers useful to compute lcp^ (iS) for 
j > 0. In particular, Cj [q] stores the value in LCP between the j-suffix Wi [k—j, k] 
and the previous suffix in the GSA with respect to the lexicographic order of all 
the suffixes of S of length at most j, whereas Sj[q] contains the value in LCP 
between the j-suffix and the next suffix in GSA (if it exists). Such values will 
be computed at the iteration j — 1 according to Theorem [TJ We observe that Cj 
and Sj contain exactly one integer for each sequence in the collection and they 
use 0{m log k) bits of workspace. 

At the iteration j = 0, the algorithm initializes the segments Bq and Lq as 
described in previous section, i.e. Bq{0) = wo[k — l]wi[k — 1] ■ ■ ■ Wm-i[k — 1] and 
for each g = 0, . . . , m — 1, we set Lq{0) [q] — 0. Consequently, for q = 0, . . . , m — 1, 
the arrays are initialized by setting Nf)[q] — q, PQ[q] — q+1^ QoM = 0, Ci[q] — 1 
and Si[q] = 1. 

For I/O efficiency, each iteration j > can be divided into two consecutive 
phases: during the first one we read only the segments Bj-i in order to find 
the arrays Pj, Qj Nj and Uj. In phase 2, the segments Sj-i and are read 
once sequentially both for the construction of new segments Bj and Lj and for 
the computation of the arrays C^+i and 5^+1, as they will be used in the next 
iteration. In the following we describe both the phases of the generic iteration 
j > 0. Figure [T] illustrates the execution of the algorithm for a simple collection 
at the iterations 12 and 13. 



In the first phase the arrays Pj, Qj and Nj are computed. In particular, 
if Wilk — J — 1] (or the end marker $ for the last step) is the new symbol to 
be inserted, its position r is obtained by computing the number of occurrences 
of Cz = Wi[k — j] in i?j_i(0), . . . , — 1) and in _Bj_i(t;)[l, t], where Cy = 

Wi[k — (j — 1)] and t is the position of Cz in Bj^i{v). Hence, the index z is stored 
into Qj at some position q, the computed position r (where storing the new 
symbol) is added to the array Pj [q] and i is added to Nj [q] . Note that in order 
to find the positions, a table of 0{a^ log(mfc)) bits of memory is used. Finally 
we sort Qj , Pj , Nj , Cj , Sj where the first and the second keys of the sorting are 
the values in Qj and Pj respectively. 

Here we focus on the second phase in which the computation of the segments 
Lj is performed by using the arrays Cj and Sj constructed during the previous 
step. Note that the sorting of the arrays allows us to open and sequentially read 
the pair files {Bj^i{h) and Lj^iih) for = 0, . . . , a) at most once. 

For all symbols in Uj that we have to insert in the segment Bj{h), the crucial 
point is to compute Cj+i by using LCl!^ and Sj+i by using LSIj while the new 
files are being constructed, instead of using auxiliary data structures to compute 
rank and select. For each index z, we consider all the elements in Qj equal to 
z. Because of the sorting, such elements are consecutive. Let 0<l,l'<m — 1 
be their first and the last positions, respectively. Hence for each I < p < I', 
we have Qj[p] = z and Pj[l] < . . . < Pj[l']. In order to apply Theorem [U we 
need to compute LCI^ and LSIj of each new symbol in its new position. Since 
each Bj(h) and Lj{h) are constructed sequentially, we do not know a priori the 
opening positions LCIj and the closing positions LSIj that are used to compute 
Cj+i and Sj+i. However we can observe that when we write a symbol x into 
Bj{h), its occurrence could be the opening or closing positions of some LCIj" 
and LSIj of x, if a; is a new symbol. Such considerations are outlined in detail 
in the following. 

For each symbol a that we insert at position s in Bj{z), with 1 < s < Pj[l], 
it is easy to see that Bj{z)[s] = Bj^i{z)[s] and Lj{z)[s] = ij_i(z)[s]. Moreover, 
the position of a could be the opening position of LCIj {a, y), if a is the new 
symbol that will be inserted at some next position y. 

For each new symbol /3 that we insert at position Pj[q] in Bj{z) {I < q < I'), 
we have [3 = Uj[Nj[q]] and, by Theorem [TJ it follows that Lj{z)[Pj[q]] = if 
Pj[q] = 1 or Lj{z)[Pj[q]] = Cj[q] otherwise. Moreover: 

— The position Pj[q] surely is the closing position of LCIj{f3,Pj[q]). If the 
position Pj[q] is the first occurrence of /3 in Bj{z), then LCIj{l3,Pj[q\) = 
Lj{z)[Pj[q]] and we set Cj+i[q] = Lj{z)[Pj[q]] + 1 according to Theorem [TJ 
Otherwise, we set Cj+i[q] — mm{LCIj{(3,Pj[q])) + 1, whose computation 
has been started when the interval was opened. 

— The position Pj[q] could be the opening position of LCIj {(3, y), ii (3 will be 
inserted, as new symbol, at some next position y. 

— The position Pj[q] could be the closing position of LSIj{f3,y), where y 
represents, eventually, the largest position Pj[f], with Pj[f] < Pj[q], for 
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Fig. 1. Iteration 12 (on tiie left) and iteration 13 (on the right) on the collection 
S = {AC ACT GT AC C A AC, CAACAGAAAGCTC}. We append diflerent end-marker 
to each string ($0 and $1, respectively) to make the explanation more immediate but 
the same situation would occur using the same symbol. The first two columns represent 
the partial LCP and the partial BWT after the iterations. The positions of the new 
symbols corresponding to the 13-suffixes (shown in bold on the right) are computed 
from the positions of the 12-suffixes (in bold on the left), which were retained in the 
array P after the iteration 12. The new values in LCP (shown in bold on the right) 
are computed during the iteration 12 and are contained in C12. The updated values in 
LCP (shown in bold and underlined on the right) are computed during the iteration 
12 and are contained in 512. 



I < f < q, where f3 has been inserted. In this case, we set Sj+i[f] = 
mm{LSIj{l3, Pj[q])) + 1 in according with Theorem [T] 
— The position Pj[q] surely is the opening position of LSIj{f3,Pj[q]). We ob- 
serve that if the position Pj [q] is the last occurrence of (3 in Bj (z) (we will dis- 
cover this at the end of the file), it means that LSIj{(3, Pj [q]) = Lj{z)[Pj [q]], 
i.e. Sj+i[q] = 1. 

For each symbol a that we insert at position {Pj[q] + 1), with Pj[q\ + 17^ 
Pj[q-|-1], i3j(z)[Pj[g]-|-l] = Bj-i{z)[Pj[q~p]] (where p is the number of the new 
symbols already inserted) and, by Theorem [U Lj{z)[Pj[q] + 1] = Sj[q]. For each 



symbol a that we insert at position s in _Bj(2;), with Pj [g] < s < Pj[q+1] {I < q < 
I'), we have Bj{z)[s] — Bj{z)[s~p] and, by Theorem[Tl Lj{z)[s] — Lj^i{z)[s—p], 
where p is the number of the new symbols already inserted. Moreover: 

— The position s could be the opening position of LCIj {a,y), if a will be 
inserted, as new symbol, at some next position y. 

— The position s could be the closing position of LSIj{a, Pj[f]), if a has 
been inserted, as new symbol, at some previous position Pj[f], with Pj[f] < 
Pj[q], for Z < / < g. In this case, we set Sj+i[f] = min(L5/|(a, Fj [/])) + 1 
according to Theorem [TJ 

For each symbol a that we insert at the position s in Bj{z), where s > Pj[l'], 
we have Bj{z)[s] = Bj{z)[s — {I' — I + 1)] and, by Theorem [U Lj{z)[s] = 
Lj-i{z)[s — {I' — I + 1)]. Moreover, the position of a could be the closing posi- 
tion of LSIj{a, Pj [/]), if a has been inserted as a new symbol at some position 
Pj[f], for I < f < I'. In this case, we set Sj+i[f] = min(LS'//(a, P,- [/])) + 1 in 
according with Theorem [TJ 

When Bj{z) is entirely built, the closing position of some LSIj{a,y) could 
remain not found. This means that the last occurrence of a appears at position 
y. Note that y must be equal to some Pj[f], I < f < V ■ ^ri this case, we set 
Sj+i[f] = 1 according to Theorem [TJ 

It is easy to verify that we can run these steps in a sequential way. Moreover, 
one can deduce that, while the same segment is considered, for each symbol 
a £ S aX most one LCIj{a, t) for some t, and at most one LSIj{a, r) for some 
r < t, will have not their closing position. For this reason we use two arrays 
minLCI and minLSI of a integers that store, for each symbol a in U, the 
minimum among the values of LCP in the possible corresponding LCI or LSI 
without closing position, respectively. 

From the size of the data structures and from the above description of the 
phases of the extLCP algorithm, we can state the following theorem. 

Theorem 2. Given a collection S of m strings of length k over an alpha- 
bet of size a, the extLCP algorithm computes BWT and LCP of S by using 
0{mk^ \oga) disk I/O and 0{{m + (T^)log(mfc)) hits of memory in 0{k{m + 
sort{m)) CPU time, where sort{m) is the time taken to sort m integers. 

The following corollary describes the performance of the method when the 
collection contains strings of different length. 

Corollary 1. Given a collection S of m strings over an alphabet of size a, the 
LCP and BWT of S are computed simultaneously in 0{{m + a'^)log{N)) bits 
of memory, with a worst-case time complexity of 0{K{m + sort{m))), where 
sort{m) is the time taken to sort m integers, N is the sum of the lengths of all 
strings and K is the length of the longest string. 



5 Computational experiments and discussion 



To assess the performance of our algorithm on real data, we used a publicly avail- 
able collection of human genome sequences from the Sequence Read Archive [5] 
at |f tp : //ftp ■ sra . ebi . ac . uk/voll/ERA015/ERA015743/srf /| and created sub- 
sets containing 43, 85, 100, 200 and 800 million reads, each read being 100 bases 
in length. We developed extLCP, an implementation of the algorithm described 
in Section|4l which is available upon request from the authors. Our primary goal 
was to analyze the additional overhead in runtime and memory consumption of 
simultaneously computing both BWT and LCP via extLCP compared with the 
cost of using BCR ([5]) to compute the BWT alone. 

Table[T]shows the results for the instances that we created. We do see increase 
in runtime since extLCP writes the values of LCP after that the symbols in BWT 
are written, so it effectively increases the I/O operations. So, a time optimization 
could be obtained if we read/write at the same time both the elements in BWT 
and LCP by using two different disks. All tests except the 800 million read 
instance were done on the same machine, having 16Gb of memory and two quad- 
core Intel Xeon E5450 S.OGHz processors. Although the LCP of a collection of 
700 million 100-mers was successfully computed on the same machine using 15Gb 
of RAM, the collection of 800 million reads needed slightly more than 16Gb so 
was processed on a machine with 64Gb of RAM and four quad-core Intel Xeon 
E7330 2.4GIIz processors. On both machines, only a single core was used for 
the computation. Moreover, to examine the behaviour of our algorithm on reads 
longer than lOObp, we created a set of 50 million 200bp long reads based on the 
100 million lOObp instance. It turns out that, although the sheer data volume is 
the same, extLCP uses 1.2Gb and takes 10.3 microseconds per input base. 

Our algorithm represents the first lightweight method that simultaneously 
computes, via sequential scans, the LCP and BWT of a vast collection of se- 
quences. Recall that the problem of the LCP computation of a collection of 
strings has been faced in |17| . but such a strategy works in internal memory. 
Recently, however, some lightweight approaches for the LCP computation of a 
single string were described in the literature. Some of them use of the sufhx 
array of the string [11I7I15I9] , but the space needed to hold this in RAM is pro- 
hibitive for NGS datasets. However, in [5], the authors give an algorithm for the 
construction of the LCP of a string that acts directly on the BWT of the string 
and does not need its suffix array. A memory-optimized version of this algorithm 
[3] (called bwt_based_laca2) needs to hold the BWT of the string in internal 
memory plus a further 1.5n bytes, where n is the length of the input string. 

Notice that an entirely like-for-like comparison between our implementation 
and the above existing implementation for BWT and LCP computation of a 
string would imply the concatenation of the strings of the collection by different 
end markers. However, for our knowledge, the existing implementations do not 
support the many millions of distinct end markers our test collections would 
require. 

An alternative is to concatenate each of strings with the same end marker. 
This leads to values in the LCP that may possibly exhibit the undesirable prop- 
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Table 1. The input string collections were generated on an Illumina GAIIx sequencer, 
all reads are 100 bases long. Size is the input size in gigabytes, wall clock time — the 
amount of time that elapsed from the start to the completion of the instance — is given 
as microseconds per input base, and memory denotes the maximal amount of memory 
(in gigabytes) used during execution. The efficiency column states the CPU efficiency 
values, i.e. the proportion of time for which the CPU was occupied and not waiting for 
I/O operations to finish, as taken from the output of the /usr /bin/time command. 



erties of exceeding the lengths of the strings and depending on the order in which 
the strings are concatenated, but does allow the BWT of the resulting string to 
be computed in external memory by using the algorithm bwte proposed in [5]. 

The combined BWT/LCP computation provided by extLCP has a faster 
runtime than bwte. In particular, for the 0085M instance, bwte uses 14Gb of 
memory and needs 3.84 microseconds per input base vs extLCP that uses 2Gb 
of memory and 3.81 microseconds per input base. 

We have also used BCR by suitable preprocessing steps, to simulate the com- 
putation of the BWT of the concatenated strings. We compared BCR, extLCP 
and bwt_based_laca2 on the 0200M instance. Since the memory consumption 
of bwt_based_laca2 exceeded 16Gb on this dataset, we ran the tests on a ma- 
chine of identical CPU to the 16Gb machine, but with 64Gb RAM. 

With BCR, the BWT was created in under 5 hours of wallclock time taking 
only 4Gb of RAM, while bwt_based_laca2 required 18Gb of RAM to create 
the LCP in about 1 hour 45 minutes. Our new method extLCP needed 4.7Gb 
of RAM to create both BWT and LCP in just under 18 hours. Attempting to 
use bwt_based_laca2 to compute the LCP of the OSOOAf instance exceeded the 
available RAM on the 64Gb RAM machine. 

The experimental results show that our algorithm is a competitive tool for the 
lightweight simultaneous computation of LCP and BWT on the string collections 
produced by NGS technologies. Actually, the LCP and BWT are two of the three 
data structures needed to build a compressed suffix tree (CST) [TB] of a string. 
The strategy proposed in this paper could enable the lightweight construction of 
CSTs of strings collections for comparing, indexing and assembling vast datasets 
of sequences when memory is the main bottleneck. 



Our current prototype can be further optimized in terms of memory by per- 
forming the sorting step in external memory. Further saving of the working space 
could be obtained if we embody our strategy in BCRext or BCRext++ (see [3]). 
These methods, although slower than BCR, need to store only a constant and 
(for the DNA alphabet) negligibly small number of integers in RAM regardless 
of the size of the input data. 
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